Prague Czech-English Dependency Treebank. Syntactically Annotated Resources for Machine Translation

نویسندگان

  • Martin Cmejrek
  • Jan Curín
  • Jirí Havelka
  • Jan Hajic
  • Vladislav Kubon
چکیده

This paper introduces the Prague Czech-English Dependency Treebank (PCEDT), a new Czech-English parallel resource suitable for experiments in structural machine translation. We describe the process of building the core parts of the resources – a bilingual syntactically annotated corpus and translation dictionaries. A part of the Penn Treebank has been translated into Czech, the dependency annotation of the Czech translation has been done automatically from plain text. The annotation of Penn Treebank has been tranformed into dependency annotation scheme. A subset of corresponding Czech and English sentences has been annotated by humans. First experiments in Czech-English machine translation using these data have already been carried out. The resources being created at Charles University in Prague are scheduled for release as Linguistic Data Consortium data collection in 2004.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Prague Czech-English Dependency Treebank: Any Hopes For A Common Annotation Scheme?

The Prague Czech-English Dependency Treebank (PCEDT) is a new syntactically annotated Czech-English parallel resource. The Penn Treebank has been translated to Czech, and its annotation automatically transformed into dependency annotation scheme. The dependency annotation of Czech is done from plain text by automatic procedures. A small subset of corresponding Czech and English sentences has be...

متن کامل

Building a Parallel Bilingual Syntactically Annotated Corpus

This paper describes a process of building a bilingual syntactically annotated corpus, the PCEDT (Prague Czech-English Dependency Treebank). The corpus is being created at Charles University, Prague, and the release of this corpus as Linguistic Data Consortium data collection is scheduled for the spring of 2004. The paper discusses important decisions made prior to the start of the project and ...

متن کامل

Treebanks in Machine Translation

We present an approach using treebanks in machine translation. Our experiment in Czech-English machine translation is an attempt to develop a full machine translation system based on dependency trees (Dependency Based Machine Translation, DBMT). We use the following resources: Prague Dependency Treebank, a newly created Czech-English parallel corpus of Penn Treebank, English monolingual corpus,...

متن کامل

Bilingual English-Czech Valency Lexicon Linked to a Parallel Corpus

This paper presents a resource and the associated annotation process used in a project of interlinking Czech and English verbal translational equivalents based on a parallel, richly annotated dependency treebank containing also valency and semantic roles, namely the Prague Czech-English Dependency Treebank. One of the main aims of this project is to create a high-quality and relatively large em...

متن کامل

CzEngVallex: Mapping Valency between Languages

This report presents a guideline for building a resource connected with the project of interlinking Czech and English verbal translational equivalents, based on a parallel, richly annotated dependency treebank containing also valency and semantic roles, namely the parallel Prague CzechEnglish Dependency Treebank. One of the main aims of this project is to create a high-quality and relatively la...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004